Load required packages:

library(RColorBrewer) ## nicer color schemes
library(xcms)         ## the package doing the job
library(tidyverse)    ## potentially useful 
library(knitr)
library(plotly)
library(FactoMineR)
library(factoextra)
library(effectsize)

We will analyze a subset of the data from UPLC-QTof MS untargeted analysis of Vitis vinifera L. leaves, collected in Italy and Germany from two fungus-resistant grape varieties: Regent and Phoenix. For this tutorial, we will focus on the quest of metabolic “biomarkers” for the geographical origin within Phoenix samples,
The raw data files (both in mzML and CDF format) are available from the MetaboLights repository. To speed up processing of this exercise we will restrict the analysis to the following subset of study files:

  • QC:

  • L011_QC_RP_pos01.CDF

  • L003_QC_RP_pos01.CDF

  • L018_QC_RP_pos01.CDF

  • Italy Phoenix:

  • L015_010_LPIi_RP_pos01.CDF

  • L006_002_LPIh_RP_pos01.CDF

  • L026_019_LPIc_RP_pos01.CDF

  • Germany Phoenix:

  • L023_017_LPGc_RP_pos01.CDF

  • L037_029_LPGe_RP_pos01.CDF

  • L047_037_LPGa_RP_pos01.CDF

You have to download them (in .CDF format) and save them on your PC in the same directory as this .Rmd file. Each file contains data in centroid mode acquired in positive ESI mode over a mass range from 50 to 2000 Da by using a chromatography of 28 minutes.

Note A large part of this tutorial is taken from the official vignette of xcms. Many thanks to Steffen Neumann and Johannes Rainer!

Data Loading

We start getting the raw data into R:

## Get the full path to the CDF files
cdfs <- list.files("../data/cdfs_grape/", full.names = TRUE)
cdfs <- cdfs[grepl("CDF", cdfs)]
cdfs
## [1] "../data/cdfs_grape//L003_QC_RP_pos01.CDF"       "../data/cdfs_grape//L006_002_LPIh_RP_pos01.CDF"
## [3] "../data/cdfs_grape//L011_QC_RP_pos01.CDF"       "../data/cdfs_grape//L015_010_LPIi_RP_pos01.CDF"
## [5] "../data/cdfs_grape//L018_QC_RP_pos01.CDF"       "../data/cdfs_grape//L023_017_LPGc_RP_pos01.CDF"
## [7] "../data/cdfs_grape//L026_019_LPIc_RP_pos01.CDF" "../data/cdfs_grape//L037_029_LPGe_RP_pos01.CDF"
## [9] "../data/cdfs_grape//L047_037_LPGa_RP_pos01.CDF"

In the last few years the xcms developers have been making a big effort to make their package coherent with a general framework for the handling of MS data in R (metabolomics, proteomics …). All this goes beyond the scope of our course, for us is sufficient to know that this infrastructure allows to store sample “metadata” (e.g. treatment class, time point, etc) together with the raw experimental data.

In our specific case, the data frame with the phenotipic data could be designed as follow:

phenodata <- tibble(matrix(ncol = 5, nrow = length(cdfs)))
colnames(phenodata) <- c("filename", "type", "country", "variety", "class")
phenodata$filename <- cdfs
phenodata$type <- "leaves"
phenodata$type[grep("QC", cdfs)] <- "QC"
phenodata$country <- "QC"
phenodata$country[grep("I", cdfs)] <- "italy"
phenodata$country[grep("G", cdfs)] <- "germany"
phenodata$variety <- "QC"
phenodata$variety[grep("LP", cdfs)] <- "phoenix"
phenodata$class <- paste(phenodata$country, phenodata$variety, sep = "_")
phenodata$class <- gsub("QC_QC", "QC", phenodata$class)

phenodata
## # A tibble: 9 × 5
##   filename                                       type   country variety class          
##   <chr>                                          <chr>  <chr>   <chr>   <chr>          
## 1 ../data/cdfs_grape//L003_QC_RP_pos01.CDF       QC     QC      QC      QC             
## 2 ../data/cdfs_grape//L006_002_LPIh_RP_pos01.CDF leaves italy   phoenix italy_phoenix  
## 3 ../data/cdfs_grape//L011_QC_RP_pos01.CDF       QC     QC      QC      QC             
## 4 ../data/cdfs_grape//L015_010_LPIi_RP_pos01.CDF leaves italy   phoenix italy_phoenix  
## 5 ../data/cdfs_grape//L018_QC_RP_pos01.CDF       QC     QC      QC      QC             
## 6 ../data/cdfs_grape//L023_017_LPGc_RP_pos01.CDF leaves germany phoenix germany_phoenix
## 7 ../data/cdfs_grape//L026_019_LPIc_RP_pos01.CDF leaves italy   phoenix italy_phoenix  
## 8 ../data/cdfs_grape//L037_029_LPGe_RP_pos01.CDF leaves germany phoenix germany_phoenix
## 9 ../data/cdfs_grape//L047_037_LPGa_RP_pos01.CDF leaves germany phoenix germany_phoenix

Below we restrict the phenodata matrix to our samples of interest (in this case, for the tutorial part QC samples and all samples from the variety Phoenix):

Up to now nothing has been actually loaded into R. To do that:

## As you can see the data frame with the phenotypic data is included inside the object holding the raw data
raw_data <- readMSData(
  files = phenodata$filename, 
  pdata = new("NAnnotatedDataFrame", 
              phenodata), ## this is the structure of xcms holding phenotypic data
  mode = "onDisk")  ## with this parameter the data are not loaded into RAM

Loading the full dataset into RAM can be problematic for large studies so with this specific on disk mode the raw data are still staying on the disk.

We next restrict the data set to the retention time range from 800 to 1000 seconds and to the mass-to-charge ratio range from 50 to 1000, just to save some processing time …

## These two xcms functions are used to subset the raw data
raw_data <- filterRt(raw_data, c(800, 1000))
raw_data <- filterMz(raw_data, c(50, 1000))

Data Visualization

It is important that along the process one can be able to visualize the raw data, so let’s give a look to the structure of the R object we created.

The raw_data object contains the full set of 3D data collected in all our samples. The “raw” values can be extracted by using:

rt <- rtime(raw_data)
mz <- mz(raw_data)
I <- intensity(raw_data)

Let’s look to the structure of these three objects:

glimpse(rt)
##  Named num [1:2512] 800 802 803 803 804 ...
##  - attr(*, "names")= chr [1:2512] "F1.S1120" "F1.S1121" "F1.S1122" "F1.S1123" ...

These are the retention times in seconds for the chromatography of all loaded files. For example, F1.S1120 stands for File1, scan number 1120 and it was recorded at 800 seconds.

Another way to see that:

plot(rt)

## The full number of scans for the set of 9 files
length(rt)
## [1] 2512

Here the individual lines highlight the increasing time scale for each file.

mz and I holds the mass spectra collected at each scantime … for this reason the two objects are lists and not vectors. Remember our data are 3D. For each scantime we have a complete mass spectrum.

## only the first 20
glimpse(mz[1:10])
## List of 10
##  $ F1.S1120: num [1:3416] 54.7 54.9 55 55.1 59.3 ...
##  $ F1.S1121: num [1:3366] 50.7 52.1 53.1 53.8 55 ...
##  $ F1.S1122: num [1:3333] 51.9 53.5 53.8 55 55.1 ...
##  $ F1.S1123: num [1:3364] 50.5 54.8 55 60 63.7 ...
##  $ F1.S1124: num [1:3387] 50.5 53 53.9 55 56.9 ...
##  $ F1.S1125: num [1:3404] 50 50.7 53.4 56.9 57.7 ...
##  $ F1.S1126: num [1:3424] 60.2 61.5 62.4 63.5 63.7 ...
##  $ F1.S1127: num [1:3356] 50.2 50.5 52.7 55.1 56 ...
##  $ F1.S1128: num [1:3368] 50.5 54.4 55 55 56.9 ...
##  $ F1.S1129: num [1:3483] 51.3 54.5 55.1 55.5 55.7 ...
## only the first 20
glimpse(I[1:10])
## List of 10
##  $ F1.S1120: num [1:3416] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1121: num [1:3366] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1122: num [1:3333] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1123: num [1:3364] 1.01 1.01 2.03 1.01 1.01 ...
##  $ F1.S1124: num [1:3387] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1125: num [1:3404] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1126: num [1:3424] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1127: num [1:3356] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1128: num [1:3368] 1.01 1.01 1.01 1.01 1.01 ...
##  $ F1.S1129: num [1:3483] 1.01 1.01 1.01 1.01 1.01 ...

We can plot a complete spectrum (here the first scan of the first sample …):

plot(mz[[1]], I[[1]], type = "h", 
     main = names(mz)[1], xlab = "m/z", ylab = "intensity")

xcms provides tools to visualize and play with the raw data, in a more direct way:

## the size of our raw data
length(raw_data)
## [1] 2512

is exactly the number of scans, and:

raw_data[[1]]
## Object of class "Spectrum1"
##  Retention time: 13:20 
##  MSn level: 1 
##  Total ion count: 3416 
##  Polarity: -1

is indeed a spectrum, which has a plot method:

plot(raw_data[[1]])

This is a sort of gg stuff so if we want an interactive stuff we can rely on the plotly package, and also change some of the characteristics of the graphical layout:

ggplotly((plot(raw_data[[1]])))

Ok, working with all files together is not the best … for visualization and handling. To facilitate the “cutting” by file, xcms is provided with a split() function which can be combined with a fromFile function to create a list with the content separate by file:

single_raw <- split(raw_data, fromFile(raw_data))

and each element of the list is now a single raw data:

single_raw[[1]]
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.13 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 279 
##  MSn retention times: 13:20 - 16:39 minutes
## - - - Processing information - - -
## Data loaded [Thu Nov  4 11:12:08 2021] 
## Filter: select retention time [800-1000] and MS level(s), 1 [Thu Nov  4 11:12:08 2021] 
## Filter: trim MZ [50..1000] on MS level(s) 1. 
##  MSnbase version: 2.18.0 
## - - - Lazy processing queue content  - - -
##  o  filterMz 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1
##   varLabels: filename type ... class (5 total)
##   varMetadata: labelDescription
## Loaded from:
##   L003_QC_RP_pos01.CDF 
## protocolData: none
## featureData
##   featureNames: F1.S1120 F1.S1121 ... F1.S1398 (279 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

As we discussed, metabolites are visible as peaks in 3D mz/rt/intensity plane …

mytibble <- tibble(rt = rtime(single_raw[[1]]), 
                   mz = mz(single_raw[[1]]), 
                   I = intensity(single_raw[[1]]))
mytibble
## # A tibble: 279 × 3
##       rt mz            I            
##    <dbl> <named list>  <named list> 
##  1  800. <dbl [3,416]> <dbl [3,416]>
##  2  802. <dbl [3,366]> <dbl [3,366]>
##  3  803. <dbl [3,333]> <dbl [3,333]>
##  4  803. <dbl [3,364]> <dbl [3,364]>
##  5  804. <dbl [3,387]> <dbl [3,387]>
##  6  805. <dbl [3,404]> <dbl [3,404]>
##  7  805. <dbl [3,424]> <dbl [3,424]>
##  8  806. <dbl [3,356]> <dbl [3,356]>
##  9  807. <dbl [3,368]> <dbl [3,368]>
## 10  807. <dbl [3,483]> <dbl [3,483]>
## # … with 269 more rows

And now a fancy plot ….

jet.colors <- colorRampPalette(
  c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", 
    "yellow", "#FF7F00", "red", "#7F0000"))
mapplot <- mytibble %>% 
  unnest(c("mz","I")) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.5) + 
  scale_color_gradientn(colours = jet.colors(7)) + 
  theme_light()
mapplot + 
  xlim(920, 970) + 
  ylim(412, 420)

  1. A mass spectrum can be seen as a vertical cut of the previous map
  2. We see the accuracy of the mass and also the phenomenon used by CentWave
  3. Some of the “peaks” are organized in vertical groups, these are the ions coming from the same metabolite
  4. The vertical gaps are associated to the lockmass scans!

The second thing we would like to visualize is one extracted ion trace:

## Define the rt and m/z range of the peak area
rtr <- 15.6*60
mzr <- 413.1207
## extract the chromatogram
chr_raw <- chromatogram(raw_data, 
                        mz = mzr + 0.1*c(-1, 1), 
                        rt = rtr + 30*c(-1, 1))
plot(chr_raw)

So we are able to actually see the chromatographic peak of the m/z 413.

Data Inspection

When one is dealing with the initial investigation of the data the first thing to do is to look to the total ion current of each chromatogram or to the base peak ion chromatogram:

## Get the total ion chromatograms. This reads data from all files.
tics <- chromatogram(raw_data, aggregationFun = "sum")
## Define colors for the two groups:
group_colors <- paste0(brewer.pal(3, "Set1"), "60")
names(group_colors) <- c("QC", "italy", "germany")
## Plot all chromatograms:
plot(tics, col = group_colors[raw_data$country])

## raw_data$country extracts the info on the phenotipic data inside the raw_data

As you can see, the results are different with a “common” look and feel. The large majority of peaks is there, even if the intensities are not always comparable. Red traces (QCs) are highly reproducible, speaking of an overall good reproducibility of the analytical pipeline.

The overall integral of the signal in each sample is often used as a way to spot gross analytical drifts

## here we rely on the old (and efficient) R style
total_I <- sapply(tics, function(x) sum(intensity(x)))
plot(total_I, col = gsub("60", "", group_colors[raw_data$country]), pch = 19, cex = 2)

We clearly see the overall difference between blue and green samples. This already indicate that the two classes of samples are strongly different. This is not really important at this point, but one should keep this in mind when we will perform the statistical analysis. In general, an important thing to look at here is the TIC of QCs: if the analytical performance of the sistem is optimal, they should be comparable.

A second and often less general quality check of the experimental data rely on the inspection of the chromatographic trace of one (or more) known compounds. In the case at hand, we know that a specific metabolite present in my samples yields an ion @mz413… let’s look to the profile of this ion signal over the chromatographic time:

## here we get the traces ... compare the function with the one used for the TICs
ion_I_know <- chromatogram(raw_data, mz = mzr + 0.1*c(-1, 1))
plot(ion_I_know, col = group_colors[raw_data$country])

The previous plot is important: it is telling us that the metabolite_I_know is present in the samples and is released by the chromatographic column at around 935 sec … There it is producing a peak in the signal of the ion_I_know @mz413

To automatically find metabolites in my data I have to teach to the computer to look for peaks in the chromatographic traces of all possible ions.

Peak Picking: one sample

The “older” and most sounding way of finding peaks implemented in xcms is the matched filter algorithm.

A full description of the parameters of the algorithm can be found in the xcms manual, here we focus on:

  • binSize: the “width” of the m/z bins used to find the peaks
  • fwhm: the “expected” size of the peak
  • snthresh: the signal/to noise ratio of the peak

In xcms the parameters of the algorithm are stored into a specific object:

mf <- MatchedFilterParam(binSize = 0.1, 
                         fwhm = 6, 
                         snthresh = 6) 
mf
## Object of class:  MatchedFilterParam 
##  Parameters:
##  - binSize: [1] 0.1
##  - impute: [1] "none"
##  - baseValue: numeric(0)
##  - distance: numeric(0)
##  - fwhm: [1] 6
##  - sigma: [1] 2.547987
##  - max: [1] 5
##  - snthresh: [1] 6
##  - steps: [1] 2
##  - mzdiff: [1] 0.6
##  - index: [1] FALSE

Now I can use the previous parameters to find the peaks in one sample:

first_peaks <- findChromPeaks(single_raw[[1]], param = mf)

The actual list of peaks can be extracted from the previous object by the method chromPeaks.

Let’s look to the head of the output:

first_peak_table <- chromPeaks(first_peaks) 
dim(first_peak_table)
## [1] 383  13
head(first_peak_table, 5)
##              mz     mzmin     mzmax      rt   rtmin   rtmax      into      intf     maxo     maxf i        sn sample
## CP001  83.04975  83.04828  83.05143 935.382 932.817 939.117 100.61855  86.77055 20.15974 21.96211 1  6.129351      1
## CP002  85.02940  85.02786  85.03004 813.489 809.747 816.054 283.34467 238.70072 62.62814 56.30592 1 10.705693      1
## CP003  93.07115  93.07047  93.07191 996.838 991.832 999.395 330.82682 326.61803 54.80862 62.14951 1  9.224595      1
## CP004  99.08113  99.07874  99.08565 996.838 993.104 999.395  72.73965  89.41898 18.42368 20.41894 1  9.315135      1
## CP005 105.03462 105.02673 105.03739 874.465 871.899 877.025  35.55708  37.69402 10.12658 11.77455 1  7.586727      1

The first two numbers are telling us that with the setting we have been choosing we were able to find 383 peaks.

The help of xcms describes the most relevant columns of the table:

“mz” (intensity-weighted mean of mz values of the peak across scans/retention times), “mzmin” (minimal mz value), “mzmax” (maximal mz value), “rt” (retention time of the peak apex), “rtmin” (minimal retention time), “rtmax” (maximal retention time), “into” (integrated, original, intensity of the peak), “maxo” (maximum intensity of the peak), “sample” (sample index in which the peak was identified)

This is the map of the identified peaks

as_tibble(first_peak_table) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz), size = 2, alpha =1, pch = 21, fill = "orange") + 
  theme_light()

  • we have a lot of peaks!
  • in some cases peaks are arranged in vertical stripes, these are probably the signatures of a metabolite … which is producing a different set of ions when it is introduced in the ionization interface.

Peak picking can also be performed with another algorithm: CentWave

cwp <- CentWaveParam(peakwidth = c(5, 30), 
                     ppm = 30,
                     prefilter = c(3, 100))
cwp
## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 30
##  - peakwidth: [1]  5 30
##  - snthresh: [1] 10
##  - prefilter: [1]   3 100
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 1
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 0
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE

Also here many parameters (and others are not mentioned). I highlight here some of them:

  • peakwidth: this is the expected range of the width of the chromatographic peaks. In this case from 5 to 30 seconds
  • ppm: this is the expected mass shift of the signal of a “true” ion due to electric noise
  • prefilter: this is an initial filter which will consider valid only ion traces which are preserving a signal of more than 100 counts for at least 5 scans.

If we run the peak picking with this new algorithm…

first_peaks_cw <- findChromPeaks(single_raw[[1]], param = cwp)
first_peak_table_cw <- chromPeaks(first_peaks_cw) 
dim(first_peak_table_cw)
## [1] 152  11
head(first_peak_table_cw, 5)
##             mz    mzmin    mzmax      rt   rtmin   rtmax      into      intb     maxo  sn sample
## CP001 243.0677 243.0672 243.0690 802.710 800.254 807.822  711.4057  686.0903 164.0446  43      1
## CP002 139.0407 139.0399 139.0412 802.710 800.254 813.489 1817.9908 1647.9536 218.6978  17      1
## CP003 380.1419 380.1408 380.1432 802.710 800.254 805.907 1272.7966 1266.3360 307.1760 146      1
## CP004 380.1417 380.1408 380.1431 807.187 805.907 813.489  775.9329  767.5927 168.5718  81      1
## CP005 169.0518 169.0503 169.0533 803.349 800.254 810.387 1035.3783 1020.4388 133.5483  71      1

As you see the number of columns is different, but the key infos are there.

as_tibble(first_peak_table_cw) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz), size = 2, alpha =1, pch = 21, fill = "orange") + 
  theme_light()

If we superimpose them…

t_mf <- as_tibble(first_peak_table)


as_tibble(first_peak_table_cw) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz), size = 2, alpha =1, pch = 21, fill = "orange") + 
  geom_point(data = t_mf, mapping = aes(x = rt, y = mz), size = 2, alpha =0.5, col = "steelblue") + 
  theme_light()

The difference is striking!

Obviously one could fiddle around with the parameters to look for a more coherent picture, but the difference is not unexpected considering the fact that we are dealing with two different approaches.

Peak Picking: all the dataset

When we are satisfied with a set of peak picking parameters, the algorithm will be sequentially run on all the files of the dataset resulting in a large list of peaks assigned to the different samples.

xdata <- findChromPeaks(raw_data, param = cwp)

Here a table of the peaks found in all files:

table(chromPeaks(xdata)[, "sample"])
## 
##   1   2   3   4   5   6   7   8   9 
## 152 164 152 159 149 113 190 108  86

An overall representation of their distribution in the plane is extremely interesting:

chromPeaks(xdata) %>% 
  as_tibble() %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = into), size = 0.3) + 
  facet_wrap(~sample) + 
  theme_light()

As you can see the samples are different, but the overall “look and feel” is coherent. This is telling us that the overall analytical run was good.

Regarding the retention time shifts…

chromPeaks(xdata) %>% 
  as_tibble() %>% 
  filter(sample %in% c(1, 8)) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) + 
  scale_color_brewer(palette = "Set1") + 
  theme_light()

From the plot we can see a small but visible shift in RT. The shift is responsible of a difference in the samples coming from the analysis and not the biology and it has to be corrected to avoid biased results.

xcms can do much more to browse and characterize the peaks, but here we want to focus on the key ideas.

In summary:

  • the list are always different
  • they are different even if we analyze the same sample twice
  • they are different for analytical/instrumental reasons
  • they are different for biological reasons.

Alignment

The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.

Also here a plethora of approaches is available. As usual, everything will work better if the chormatography is more reproducible (for GC, for example, retention time correction is often not necessary).

In xcms the most used and reliable method for alignment of high resolution experiments is based on the obiwarp approach. The algorithm was developed for proteomics and is based on dynamic time warping.

The alignment is performed directly on the profile-matrix and can hence be performed independently of the peak detection or peak grouping.

If the samples look quiet different among them, it might be helpful to perform the alignment based on only QC samples (or another subset of samples) and use these to adjust the full data set.

xdata <- adjustRtime(xdata, param = ObiwarpParam(
  binSize = 0.2,
  subset = which(xdata$type == "QC"),
  subsetAdjust = "average"))
  • binSize set the width of the slices of the m/z bins used to extract the traced which are then aligned

It is of utmost importance to check the amount of correction since large time shifts are not reasonable. Below we plot the BPC before and after applying the RT correction, as well as the differences of the adjusted- to the raw retention times per sample.

chr_raw <- chromatogram(xdata, aggregationFun = "max", 
                        adjustedRtime = FALSE)
chr_adj <- chromatogram(xdata, aggregationFun = "max")
par(mfrow = c(3, 1), mar = c(2, 4.3, 2, 0.5))
plot(chr_raw, peakType = "none", main = "BPC, raw", 
     col = group_colors[xdata$country])
plot(chr_adj, peakType = "none", main = "BPC, adjusted", 
     col = group_colors[xdata$country])
plotAdjustedRtime(xdata, col = group_colors[xdata$country])

As you can see the correction is never bigger than 2 seconds. With a chromatographic peak width of around 10 seconds this is more than acceptable and, another time it speaks of a overall good analytical reproducibility.

xdata now still contains the list of the peaks for the different samples, but now they retention time should be less erratic…

chromPeaks(xdata) %>% 
  as_tibble() %>% 
  filter(sample %in% c(1,8)) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) + 
  scale_color_brewer(palette = "Set1") + 
  theme_light()

As you can see the situation has improved and some of the vertical stripes are now well aligned.

Correspondence

The last step is to find a consensus list of variables across the different samples. These will be the features which will be used for the data analysis. The list of peaks is now aligned in retention time, but:

  • peaks are still separated per sample
  • a peak could be present only in a group of samples (because a metabolite is missing there)
  • a peak could be missing because it was not correctly identified

The common way of doing this step in xcms relies in a density based approach.

The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis (all peaks found in all samples together!) within small slices along the m/z dimension.

Care should be taken to account for the fact that a peak could be absent in a sample or in a set of samples and to avoid, in the meantime, to keep peaks found only in one sample.

As before, the parameters of this step are included in a specific object

pdp <- PeakDensityParam(sampleGroups = xdata$country, 
                        minFraction = 0.5, 
                        bw = 30,
                        binSize = 0.1)

A set of peaks will be considered a candidate to become a “valid” group if it contains peaks coming from at least a minFraction of samples belonging to one of the sampleGroups.

An example will make this more clear. Suppose I have a dataset with two sample groups: one of 4 samples, the other of 6. If I set minFraction to 0.5, a group of peaks will be considered a feature if it contains at least:

  • peaks coming from 2 samples of the first group
  • peaks coming from 3 samples of the second group

… or more.

  • binsize: set the width in the m/z dimension to collect peaks from the different peaklists
  • bw: this is the bandwidth of the density estimate used to estimate the distribution of the peaks in the retention time dimension

Grouping is finally performed with:

xdata <- groupChromPeaks(xdata, param = pdp)

The features are now the variables which will show-up in the data matrix. Their definition has been added by the groupChromPeaks method to the xdata object (which also contains the definition of the peaks of the different samples).

The definition can be extracted as a dataframe:

myfeatures <- featureDefinitions(xdata)
head(myfeatures, 5)
## DataFrame with 5 rows and 12 columns
##           mzmed     mzmin     mzmax     rtmed     rtmin     rtmax    npeaks   germany     italy        QC            peakidx  ms_level
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>             <list> <integer>
## FT001   95.0865   95.0862    95.087   849.097   848.471   850.046         3         0         3         0        180,511,926         1
## FT002  107.0871  107.0869   107.087   996.557   996.238   996.876         2         0         2         0           289,1050         1
## FT003  119.0872  119.0866   119.088   849.737   848.471   850.402         5         0         3         2     38,189,500,...         1
## FT004  121.0664  121.0663   121.066   996.877   996.876   996.878         2         0         2         0           290,1051         1
## FT005  121.1025  121.1021   121.103   996.876   996.238   996.876         4         2         2         0  291, 871,1052,...         1

The table contains:

  • the definition of the “position” of the feature in the mz/rt plane (mzmed,mzmin,mzmax,rtmed,rtmin,rtmax)
  • the number of peaks which were assigned to that feature npeaks
  • the number of samples (per group) which have peaks that have been joined in each feature
  • the index of the peaks grouped in each feature

The (almost) final untargeted data matrix can be extracted from the same object with:

DM <- featureValues(xdata, value = "into")
dim(DM)
## [1] 135   9

The intensity used to build the data matrix is normally chosen as:

  • into: integrated, original, intensity of the peak
  • maxo: maximum intensity of the peak

In our simple example we have 135 variables measured over 9 samples:

head(DM)
##       L003_QC_RP_pos01.CDF L006_002_LPIh_RP_pos01.CDF L011_QC_RP_pos01.CDF L015_010_LPIi_RP_pos01.CDF L018_QC_RP_pos01.CDF
## FT001                   NA                   874.6886                   NA                   1221.790                   NA
## FT002                   NA                   949.0719                   NA                         NA                   NA
## FT003             845.8922                  1398.7975                   NA                   1202.988             828.6412
## FT004                   NA                   730.9143                   NA                         NA                   NA
## FT005                   NA                  1468.2392                   NA                         NA                   NA
## FT006                   NA                  1261.4381                   NA                   1393.185                   NA
##       L023_017_LPGc_RP_pos01.CDF L026_019_LPIc_RP_pos01.CDF L037_029_LPGe_RP_pos01.CDF L047_037_LPGa_RP_pos01.CDF
## FT001                         NA                   973.9651                         NA                         NA
## FT002                         NA                  1061.6101                         NA                         NA
## FT003                         NA                  1283.7827                         NA                         NA
## FT004                         NA                   744.8442                         NA                         NA
## FT005                   651.2681                  1516.9884                   706.7103                         NA
## FT006                         NA                  1452.1044                         NA                         NA

Note that DM holds samples in columns and variables in rows, so it should be transposed to be ready for the standard analysis.

Handling NAs

So we finally get there, we have our data matrix full of intensities, but (as usual) missing values are not absent…

Another time:

  • in some cases NAs are there because that feature was not present (low concentration metabolites)
  • in other cases they are there because the long chain of steps we have made could have been leaking somewhere. Maybe one peak was showing a bad shape, or two peaks were not well separated…

To go on we have to try to fill at least a part of the holes with a reasonable number. The function fillChromPeaks() integrates the signal found in the mz-rt region of the feature. The mz and rt ranges used by the algorithm are defined as the lower quartile of the mzmin and rtmin values of all detected peaks of the feature, and the upper quartile of the mzmax and rtmax values.

  • if the peak was missing for an error in the preprocessing, it will somehow be recovered by this procedure (obviously as far as the samples are aligned…)
  • if nothing is there, the algorithm will find electrical/chemical noise… and this number will be a reasonable estimate of the signal we get when something is undetectable.

This smart approach (which works well in many cases, even if there are exceptions) is implemented in xcms with the fillChromPeaks function:

xdata_filled <- fillChromPeaks(xdata)

Now our filled data matrix looks like this…

DM_f <- featureValues(xdata_filled, value = "into")
head(DM_f)
##       L003_QC_RP_pos01.CDF L006_002_LPIh_RP_pos01.CDF L011_QC_RP_pos01.CDF L015_010_LPIi_RP_pos01.CDF L018_QC_RP_pos01.CDF
## FT001             364.3755                   874.6886             437.0850                  1221.7900             407.8034
## FT002             311.3094                   949.0719             416.2391                   579.1548             343.0307
## FT003             845.8922                  1398.7975             428.3805                  1202.9881             828.6412
## FT004             198.9592                   730.9143             320.2578                   355.7047             285.6761
## FT005             852.2550                  1468.2392             688.0599                  1259.5293             780.4506
## FT006             561.6849                  1261.4381             360.7264                  1393.1849             531.4599
##       L023_017_LPGc_RP_pos01.CDF L026_019_LPIc_RP_pos01.CDF L037_029_LPGe_RP_pos01.CDF L047_037_LPGa_RP_pos01.CDF
## FT001                   163.3833                   973.9651                   240.3923                   83.61931
## FT002                   212.3647                  1061.6101                   106.3215                   24.90527
## FT003                   175.6653                  1283.7827                   265.0664                  130.04801
## FT004                   171.3388                   744.8442                   175.6182                   95.46605
## FT005                   651.2681                  1516.9884                   706.7103                  404.26093
## FT006                   206.0327                  1452.1044                   221.3511                  111.78935

A clear improvement, isn’t it ? :-)

This data matrix will be the starting point of our statistical analysis…

Multivariate Visualization - PCA

The subsequent step is to take a look at the multivariate structure of the data to spot larger scale patterns.
Remember that at the beginning of the tutorial we have filtered the data to the RT range 800-1000 seconds to be faster, but for this part of the session we are interested in keeping the information from all chromatograms. For that, we already have processed this data and now we are just going to upload it in the workspace.

In order to be able to use the xdata stored in the file, the raw CDF should be organized as follows:

–Folder_name | – data | |– cdfs_grape | |– phoenix.Rdata, *.CDF – Grape_analysis |– analyze_the_grape.Rmd

If you put the raw data and the RData in a different folder you have to update the xdata@processingData@files putting thre the right path pointing to the correct files

To check the actual path to the raw data stored in the xdata you can use

fileNames(xdata)
## [1] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L003_QC_RP_pos01.CDF"      
## [2] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L006_002_LPIh_RP_pos01.CDF"
## [3] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L011_QC_RP_pos01.CDF"      
## [4] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L015_010_LPIi_RP_pos01.CDF"
## [5] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L018_QC_RP_pos01.CDF"      
## [6] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L023_017_LPGc_RP_pos01.CDF"
## [7] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L026_019_LPIc_RP_pos01.CDF"
## [8] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L037_029_LPGe_RP_pos01.CDF"
## [9] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L047_037_LPGa_RP_pos01.CDF"
load("phoenix.RData")
DM_f <- featureValues(xdata, value = "into")

In order to do a first PCA with the scope of getting a general overview of the data it is necessary to impute the data matrix, replacing NAs with meaningful numbers (it is important to keep in mind that even having applied the fillChromPeaks() function, there may still be some missing values in the data matrix):

  1. The number should have a variability
  2. The number should not know anything about the experimental design
  3. The number should have a reasonable analytical meaning

My choice here is to work variable wise and replace the NAs with a random number drawn from an uniform distribution spanning from 0 to half of the minimum value measured for that variable

The rationale behind this choice is that:

  • a concentration is a positive number
  • everything I cannot measure is equally likely

First of all I need a function to perform the imputation of a vector

myimputer <- function(v){
  if (sum(is.na(v)) == 0) {
    return(v)
  } else {
    napos <- which(is.na(v))
    newval <- runif(length(napos),0, min(v, na.rm = TRUE)/2)
    out <- v
    out[napos] <- newval
    return(out)
  }
}

Now we apply it to the full set of columns

DM_f <- t(DM_f)
set.seed(123)
DM_i <- apply(DM_f, 2, myimputer)

Below we apply log-transformation since usually the distribution of metabolomics data intensities is far from being normal. Additionally, in this case example, the number of samples for each group is extremely low.

DM_i <- log10(DM_i)

And now PCA!

myPCA <- PCA(DM_i, graph = FALSE)
fviz_pca_ind(myPCA, 
             habillage = factor(xdata$country), 
             geom = "point", 
             pointsize = 2,
             axes = c(1,2)) + 
  scale_color_brewer(palette = "Set1")

The amount of variability captured in this representation is already the 70%, and the plot shows the presence of a clear separation between sample classes.

What about the variables?

fviz_pca_biplot(myPCA, 
                habillage = factor(xdata$country), 
                geom = "point", 
                pointsize = 2,
                axes = c(1,2)) + 
  scale_color_brewer(palette = "Set1")

The biplot here seems to suggest that Italian samples probably show higher levels for the large majority of signals. Even if the biplot here is not really useful to spot clera patterns …

Biomarker Discovery

What we did so far is called exploratory data analysis. To write the paper we should also find the variables which are showing a significant effect of the design factors.

Notes

  1. One could look for biomarkers with univariate and multivariate methods.
  2. In a univariate perspective, to avoid the risks of non matching the assumptions of many statistical tests it would be safer to rely on non-parametric approaches.
  3. Unfortunately, non parametric approaches are not really useful with datasets with small number of samples … here we have 3! So to go on with our investigation we have to rely on parametric tests. In doing so we should always remember that we are making strong assumptions, so no guarantee (as usual ;-)) that our results will be valid for the population ….
  4. Since we are performing multiple tests, we should remember that false positives will be present!

We will now employ the parametric test (the Student's t-Test) using in this case the log-transformed values.

class <- xdata$country[xdata$type != "QC"]
stats.tt <- function(x){t.test(x ~ class)$p.value}
pval_tt <- apply(DM_i[xdata$class != "QC",], 2, stats.tt)
hist(pval_tt, breaks = 20, xlim = c(0, 1), col = rgb(0, 0, 1, 1/4))

If there would be no difference the distribution of p-values is expected to be uniform … there is a clear enrichment of small p-values and it speaks of the presence of a significant fraction of differences between samples from Italy and Germany. This result is not unexpected considering the large difference observed in the PCA

random_DM <- DM_i[xdata$class != "QC",] 
random_DM <- random_DM[sample(1:6),]

pval_tt_rnd <- apply(random_DM, 2, stats.tt)


hist(pval_tt, breaks = 20, xlim = c(0, 1), col=rgb(0, 0, 1, 1/4), 
     main = "Correctight and random labels")
hist(pval_tt_rnd, breaks = 20, xlim = c(0, 1), col=rgb(1, 0, 0, 1/4), add = T)

so far we know that the two classes are different, but to start with the biomarker annotation job we would need to decide a criterion to prioritize a subset of the features …

A sensible choice would be to rank the variables on the base of their contrast between the two classes… so we would be interested on focusing on those features which, in addition to being statistically significant, show high differences between their respective mean values. The presence of an “high” contrast between the two classess can be measured in terms of effect size. Specifically we’ll use the Cohen's d.

stats.cd <- function(x){cohens_d(x ~ class)$Cohens_d}
cd <- apply(DM_i[xdata$class != "QC",], 2, stats.cd)

The Cohens’d and the p-value can be represented in a volcano plot

plot(cd,1-pval_tt, xlab = "Cohen's D", main = "Volcano Plot", pch = 3)

The crosses in the upper extremes corresponds to features with high statistical significance and large effect size

Now we join all the outputs in a unique data matrix and we sort the results by effect size:

res <- cbind(
  as.data.frame(featureDefinitions(xdata))[,c("mzmed", "rtmed")], pval_tt, cd)
res$rtmed <- res$rtmed/60
res <- res[order(res[,"cd"]),]
head(res)
##          mzmed     rtmed      pval_tt        cd
## FT182 225.1504 13.518290 2.345346e-06 -57.67963
## FT581 443.1593  6.582005 1.123973e-05 -37.97001
## FT021 119.0869 20.403369 1.478472e-05 -21.71849
## FT337 324.1072  9.109826 5.815651e-04 -21.68161
## FT016 114.0560  1.464615 7.269864e-04 -21.04946
## FT431 369.1030  8.225872 2.196857e-04 -18.27667
tail(res)
##          mzmed    rtmed      pval_tt       cd
## FT363 337.0194 17.15362 3.565600e-04 11.23554
## FT305 319.0102 16.36280 1.246203e-03 12.46449
## FT392 349.1273 10.35626 1.022902e-03 12.50758
## FT726 596.1504 20.24369 1.602160e-04 15.47363
## FT427 365.1217 15.39883 5.004988e-05 15.97732
## FT282 309.0423 10.84316 1.248875e-05 34.06363

This is the list of “biomarkers”, let’s plot one of them …

sample_colors <- group_colors[xdata$country]
set.seed(123)
ptx <- as.numeric(factor(xdata$class)) + runif(length(xdata$class), -0.2, 0.2)
plot(ptx, DM_f[,rownames(res)[1]], ylab = "intensity", 
     col = gsub("60", "", sample_colors), pch = 19, xaxt = "n",
     ylim = c(0, max(DM_f[,rownames(res)[1]])),
     main = rownames(res)[1])
axis(1, seq(3), levels(factor(xdata$class)))

Which exactly shows what we expect.

Furthermore, we can also manually check its peak shape, as well as the integrated area. Within xcms there is the function featureChromatograms() which allows to visualize the EIC of an specific feature, displaying the area integrated by the algorithm:

ft_chr1 <- featureChromatograms(xdata, features = rownames(res)[1], 
                               expandRt = 15, filled = FALSE)
ft_chr2 <- featureChromatograms(xdata, features = rownames(res)[1], 
                               expandRt = 15, filled = TRUE)
par(mfrow = c(1, 2))
plot(ft_chr1, col = group_colors[xdata$country],
     peakBg = sample_colors[chromPeaks(ft_chr1)[, "sample"]])
plot(ft_chr2, col = group_colors[xdata$country],
     peakBg = sample_colors[chromPeaks(ft_chr2)[, "sample"]])
legend("topright", col = gsub("60", "", group_colors), 
       legend = names(group_colors), pch = 19)

Another way to check the feature of interest is using the function plotChromPeakDensity():

chr_mzr <- chromatogram(xdata, mz = res$mzmed[1] + 0.01 * c(-1, 1))
plotChromPeakDensity(chr_mzr, col = group_colors, param = pdp,
                     peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
                     peakCol = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
                     peakPch = 16)

The upper panel in the plot shows the EIC with the detected peaks highlighted, whereas the lower plot indicates which individual peaks were detected, as well as how they have been grouped within features (indicated with grey rectangles). The black line represents the density distribution of detected peaks along the retention times.
In this example we can see that there have been detected 2 features with this m/z value at different retention times. The plot tells us that our feature of interest (eluted around 800 seconds) has initially been detected only in Italian (blue) samples. Although the plot shows that this feature was not detected in QC neither in samples from Germany, with the boxplot we have seen that there is also a value for these samples. Note that this value has obtained during the application of the function fillChromPeaks(). This is clearly seen when we use the function featureChromatograms() and playing with the argument filled.